--- title: Title keywords: fastai sidebar: home_sidebar ---
{% raw %}
{% endraw %}

The first thing we need to do is download the data.

First of all, let's install the kaggle api and configure it with API credentials.

Let's create a folder to download the competition files.

{% raw %}
mkdir data
{% endraw %}

Let's download the files now (make sure you accept competition rules on Kaggle first or else you will not be able to access the data!).

{% raw %}
!kaggle competitions download birdsong-recognition -p data
Downloading birdsong-recognition.zip to data
100%|█████████████████████████████████████▉| 22.1G/22.1G [06:45<00:00, 83.2MB/s]
100%|██████████████████████████████████████| 22.1G/22.1G [06:45<00:00, 58.5MB/s]
{% endraw %} {% raw %}
!ls data
birdsong-recognition.zip
{% endraw %} {% raw %}
%%capture
!unzip data/birdsong-recognition.zip -d data
{% endraw %} {% raw %}
!ls data
birdsong-recognition.zip	 example_test_audio_summary.csv  train_audio
example_test_audio		 sample_submission.csv		 train.csv
example_test_audio_metadata.csv  test.csv
{% endraw %}

Let's take a look at train.csv.

{% raw %}
import pandas as pd

train = pd.read_csv('data/train.csv')
train.shape
(21375, 35)
{% endraw %} {% raw %}
train.head()
rating playback_used ebird_code channels date pitch duration filename speed species ... xc_id url country author primary_label longitude length time recordist license
0 3.5 no aldfly 1 (mono) 2013-05-25 Not specified 25 XC134874.mp3 Not specified Alder Flycatcher ... 134874 https://www.xeno-canto.org/134874 United States Jonathon Jongsma Empidonax alnorum_Alder Flycatcher -92.962 Not specified 8:00 Jonathon Jongsma Creative Commons Attribution-ShareAlike 3.0
1 4.0 no aldfly 2 (stereo) 2013-05-27 both 36 XC135454.mp3 both Alder Flycatcher ... 135454 https://www.xeno-canto.org/135454 United States Mike Nelson Empidonax alnorum_Alder Flycatcher -82.1106 0-3(s) 08:30 Mike Nelson Creative Commons Attribution-NonCommercial-Sha...
2 4.0 no aldfly 2 (stereo) 2013-05-27 both 39 XC135455.mp3 both Alder Flycatcher ... 135455 https://www.xeno-canto.org/135455 United States Mike Nelson Empidonax alnorum_Alder Flycatcher -82.1106 0-3(s) 08:30 Mike Nelson Creative Commons Attribution-NonCommercial-Sha...
3 3.5 no aldfly 2 (stereo) 2013-05-27 both 33 XC135456.mp3 both Alder Flycatcher ... 135456 https://www.xeno-canto.org/135456 United States Mike Nelson Empidonax alnorum_Alder Flycatcher -82.1106 0-3(s) 08:30 Mike Nelson Creative Commons Attribution-NonCommercial-Sha...
4 4.0 no aldfly 2 (stereo) 2013-05-27 both 36 XC135457.mp3 level Alder Flycatcher ... 135457 https://www.xeno-canto.org/135457 United States Mike Nelson Empidonax alnorum_Alder Flycatcher -82.1106 0-3(s) 08:30 Mike Nelson Creative Commons Attribution-NonCommercial-Sha...

5 rows × 35 columns

{% endraw %} {% raw %}
classes = train.ebird_code.unique().tolist()
len(classes)
264
{% endraw %}

We can listen to the longest recording in the train set:

{% raw %}
from IPython.lib.display import Audio

Audio('data/train_audio/comrav/XC246425.mp3')
{% endraw %}

How many different birds are there in the train set?

{% raw %}
train.ebird_code.nunique()
264
{% endraw %}

This is what sampling rates across the recordings look like:

{% raw %}
train.sampling_rate.value_counts()
44100 (Hz)    12693
48000 (Hz)     8373
22050 (Hz)      123
32000 (Hz)       93
24000 (Hz)       54
16000 (Hz)       34
11025 (Hz)        3
8000 (Hz)         2
Name: sampling_rate, dtype: int64
{% endraw %}

One of the organizers shared on Kaggle that all the recordings in the test set should be sampled at 32 kHz.

Let's resample the train set to 32 kHz to perform this comptuationally expensive operation only once before training.

!!! WARNING !!!

The code below was run on a 96 vCpu vm - this might take a really long time on a standard machine. To save you the hassle I uploaded the data post transformation onto GCP storage here. The files are zipped so you will have to extract them after download completes.

{% raw %}
{% endraw %} {% raw %}
mkdir data/train_resampled
{% endraw %} {% raw %}
import os

for directory in Path('data/train_audio').iterdir():
    ebird_code = directory.name
    os.makedirs(f'data/train_resampled/{ebird_code}')
---------------------------------------------------------------------------
FileExistsError                           Traceback (most recent call last)
<ipython-input-5-b73eded926c8> in <module>
      3 for directory in Path('data/train_audio').iterdir():
      4     ebird_code = directory.name
----> 5     os.makedirs(f'data/train_resampled/{ebird_code}')

/opt/conda/lib/python3.7/os.py in makedirs(name, mode, exist_ok)
    219             return
    220     try:
--> 221         mkdir(name, mode)
    222     except OSError:
    223         # Cannot rely on checking for EEXIST, since the operating system

FileExistsError: [Errno 17] File exists: 'data/train_resampled/norpar'
{% endraw %} {% raw %}
{% endraw %}

In the course of processing the files I found that I cannot read data/train_audio/lotduc/XC195038.mp3. Removing it to not get an error.

{% raw %}
Path('data/train_audio/lotduc/XC195038.mp3').unlink()
{% endraw %} {% raw %}
def resample_audio(path):
    x = librosa.load(path, sr=SAMPLE_RATE, mono=True)[0]
    ebird_code = path.parent.name
    sf.write(f'data/train_resampled/{ebird_code}/{path.stem}.wav', x, SAMPLE_RATE)
{% endraw %} {% raw %}
%%time
for directory in Path('data/train_audio').iterdir():
    file_paths = list(directory.iterdir())
    with Pool(NUM_WORKERS // 2) as p:
        p.map(resample_audio, file_paths)
{% endraw %}

Ok, so we now have all the audio saved in a convenient format and we offloaded the expensive resampling to happen before training. Great!

The big question is - how many recordings do we have per species?

{% raw %}
%%time
from collections import defaultdict

recs = defaultdict(list)
for directory in Path('data/train_resampled').iterdir():
    ebird_code = directory.name
    for file in directory.iterdir():
        recs[ebird_code].append((file, sf.info(file).duration))
CPU times: user 1.88 s, sys: 432 ms, total: 2.31 s
Wall time: 2.44 s
{% endraw %} {% raw %}
counts = [len(recs[ebird]) for ebird in recs.keys()]
min(counts), max(counts)
(9, 100)
{% endraw %} {% raw %}
import matplotlib.pyplot as plt

plt.hist(counts);
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-48e5e59ecbf5> in <module>
      1 import matplotlib.pyplot as plt
      2 
----> 3 plt.hist(counts);

NameError: name 'counts' is not defined
{% endraw %} {% raw %}
(np.array(counts) == 100).sum()
134
{% endraw %}

As it turns out, we have between 9 and 100 recordings per ebird code. Let's construct the train and validation sets taking this into account.

{% raw %}
train, val = {}, {}

for ebird in recs.keys():
    rs = recs[ebird]
    val_count = max(int(len(rs) * 0.1), 1)

    val[ebird] = rs[:val_count]
    train[ebird] = rs[val_count:]
{% endraw %} {% raw %}
{% endraw %} {% raw %}

class AudioDataset[source]

AudioDataset(recs, classes, len_mult=20, mean=None, std=None) :: Dataset

An abstract class representing a :class:Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__, which is expected to return the size of the dataset by many :class:~torch.utils.data.Sampler implementations and the default options of :class:~torch.utils.data.DataLoader.

.. note:: :class:~torch.utils.data.DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

{% endraw %} {% raw %}
pd.to_pickle(classes, 'data/classes.pkl')
pd.to_pickle(train, 'data/train_set.pkl')
pd.to_pickle(val, 'data/val_set.pkl')
pd.to_pickle(recs, 'data/recs.pkl')
{% endraw %} {% raw %}
classes = pd.read_pickle('data/classes.pkl')
train_ds = AudioDataset(pd.read_pickle('data/train_set.pkl'), classes)
valid_ds = AudioDataset(pd.read_pickle('data/val_set.pkl'), classes, len_mult=10)
{% endraw %} {% raw %}
%%time
x = []
for i in range(264*20):
    x.append(train_ds[i][0])
CPU times: user 3.01 s, sys: 1.24 s, total: 4.25 s
Wall time: 17.6 s
{% endraw %} {% raw %}
np.stack(x).mean(), np.stack(x).std()
(1.6537946e-05, 0.04572224)
{% endraw %} {% raw %}
{% endraw %}

Seems we have everything we need now to start training!

(someone could comment having a non-deterministic validation set is generally not a good idea, but I think we can live with this for now).